North-East Visualization and Analytics Center (NEVAC) Team

VAST 2008 Challenge
Grand Challenge

Authors and Affiliations

Don Pellegrino, Drexel University, don@drexel.edu [PRIMARY contact]
Chaomei Chen, Drexel University, Chaomei.Chen@ischool.drexel.edu [Faculty advisor]
Alan MacEachren, The Pennsylvania State University, maceachren@psu.edu [Faculty advisor]
Prasenjit Mitra, The Pennsylvania State University, pmitra@ist.psu.edu [Faculty advisor]
Chi-Chun Pan, The Pennsylvania State University, cpan@ist.psu.edu
Anthony Robinson, The Pennsylvania State University, arobinson@psu.edu
Michael Stryker, The Pennsylvania State University, mzs114@psu.edu
Chris Weaver, The Pennsylvania State University, cweaver@psu.edu [Faculty advisor]

Student team: YES

Tools

Improvise
by Chris Weaver
http://www.personal.psu.edu/cew15/improvise/index.html

Improvise is a research tool designed and built by team member Chris Weaver.  Implemented in Java, Improvise provides a fully graphical interface to support the development of custom visualizations.  With assistance from Chris Weaver team members developed custom visualizations for each of the datasets.  These were used to perform analysis for each mini-challenge.  An additional visualization was built to explore the associative network for the integration dataset.  Improvise was the primary tool used for exploration of the mini-challenge datasets.

Maple
by Waterloo Maple Inc.
http://www.maplesoft.com

Maple was used heavily for data transformations and numerical analysis.

Pajek
by Vladimir Batagelj and Andrej Mrvar.
http://pajek.imfm.si/doku.php

Pajek was used for analysis of the associative network.

SPSS
by SPSS Inc.
http://www.spss.com

While Maple was the primary tool for exploratory algorithm work SPSS was used for performing clustering of the wiki revisers.

Oculus GeoTime
by Oculus Info Inc.
http://www.oculusinfo.com/SoftwareProducts/GeoTime.html

GeoTime was used for a view of the cell call data from temporal and spatial perspectives.

TWiki
by Peter Thoeny
http://twiki.org

TWiki was used as a central reporting tool and for collaboration throughout the analytical process.

Adobe Connect Professional
by Adobe Systems Incorporated
http://www.adobe.com/products/acrobatconnectpro/

Adobe Connect was used for online collaboration and screen sharing during meetings of the distributed team.

Microsoft Office
by Microsoft
http://office.microsoft.com

Microsoft Excel was used lightly for ad-hoc analysis and first-cut views of the data.  As much of the source data used Microsoft formats (PowerPoint, MHT, Word) Office tools provided views of the raw data and instructions.

Two Page Summary: NO

ANSWERS

Grand-1: Based on ALL the data available (i.e. using the data from all 4 mini–challenges) what is the social network of the Paraiso movement at the end of the time period?

GrandNodes.txt
GrandLinks.txt

Grand-2: What name or names can be associated with individual activities?

Activities

Names

2002 Ferdinando Catalano publishes El Paraiso Manifesto with El Tiempo Press establishing the Paraiso movement with him as the leader.

Ferdinando Catalano

2002 Pedro Vidro publishes To Grow, To Love, To Live as a Family with El Tiempo Press establishing a commitment to the cause and a relationship with Ferdinando Catalano.

Ferdinando Catalano; Pedro Vidro

2005-08-01 Eduardo Catalano and Jesus Vidro make their first attempt for the US in a Go Fast boat but are intercepted by the USCG vessel Assured.

Eduardo Catalano; Jesus Vidro

2006-06-08 The core members of the Paraiso movement engage in activity off of the Isle Del Sueno.

David Vidro; Juan Vidro; Jorge Vidro; Estaban Catalano; Ferdinando Catalano

2006-07-04 Eduardo Catalano and Jesus Vidro make their second attempt for the US in a Go Fast boat but they are intercepted by the USCG vessel Pompano.

Eduardo Catalano; Jesus Vidro

2006-09-04 Wiki post referencing prosecution of Paraiso by Belgians posted by Angelgasperi and Barfly2001.

 

2006-09-19 Wiki post reading “GUNNED DOWN SIX DOCTORS AND NURSES IN COLD BLOOD” by Alphanzo.

 

2006-09-19 Wiki post referencing a confrontation between Paraiso members and the Department of Health by Edemir.

Paraiso Movement; Department of Health

2006-11-12 Wiki posted referencing “Catalano is dead” by Danielrengelm.

Catalano

2007-02-23 Eduardo Catalano and Jesus Vidro land on an island near Cancun, Mexico on their third attempt using a Rustic vessel.

Eduardo Catalano; Jesus Vidro

2007-08 A small IED was set off at a Miami, FL Department of Health Building.

Suspects/Witnesses

Ramon Katalanow (likely Catalano); Lindsey Bowles; Maxwell Lopez ; Karissa Graham; Carlos Vidro

 

Possible Casualties

Gale Welsh; Max Valdez; Cleveland Jimenez; Francisco Salter; Fawn Sparks; Lottie Staley; Phil Marin; Cleveland Hutchinson; Dian Crum; Lavon Lockhart; Rosario Oakley; Morton Kilgore

Grand-3: What is the geographical range of the Paraiso Movement and how it changes over time?

“Nothing to provide: The answer to this question and how you arrived at it should be combined with the answer to the next question which is more general.”

Grand-4: How do the major beliefs of the Paraiso movement affect their activities?

The Paraiso movement is organized around the family unit and male authority.  Members believe that disputes between households “should be settled between the Heads of Family themselves, with no interference from the government or police (Paraiso Manifesto, 2002).”  They also believe that “medical professional are not allowed to treat any Household member without the express permission from the Head of the House (Paraiso Manifesto, 2002).”  These beliefs could lead to escalating violence in disputes that are unmediated as they are to be handled directly between Heads of Family.  These beliefs could also lead to disputes between the government and medical professionals and members of the movement.  As a male must be the Head of Family these beliefs could also lead to violence between male heads of the family stemming from disputes amongst other members of the family.  Another tenet of the movement is that girls are to be educated in the home.  This could cause conflict in areas where home schooling is not pervasive.

The Paraiso movement began on the small Caribbean Isla Del Sueno in 2002 but Coast Guard data shows that it has spread to Florida in the United States and to Mexico.  Comments on the Wikipedia article also reference activities in Belgium from September 4, 2006.  Other comments between August and December of 2006 make reference to Spain, Canada, Miami and Philadelphia.  The “Paraiso Manifesto (2002)” Wikipedia article reports to other Caribbean Islands as well.  Running FactXtractor on the wiki edits reveals the following geographical locations: Philadelphia, Miami, Spain, US, Mexico, Canada and Belgium.

In 2005 relatively few migration events occurred.  Landings concentrated along the Florida Keys.  In 2006, the Florida Keys were still a primary destination however landings extended northwards along the western Florida coast north of Tampa.  In 2006, no events are reported on the East Coast north of Miami.  By 2007, the Florida Keys had proportionally fewer events and landings were more evenly distributed on both the East and West coasts of Florida.  Together these developments indicate that the Paraiso movement is growing from its roots in the Isle Del Sueno to neighboring regions in the Caribbean, and then onto Florida and other parts of the U.S., as well as Mexico, Canada and Western European countries like Spain, Belgium, etc.

Supporters and members of the movement actively maintain Wikipedia articles on the topic.  It appears that many of these editors want to present a public image of the movement as being a legitimate religious organization that is not involved in illegal activities.  They actively dispute claims that the organization is a cult or criminal organization.

1) Video

Click here to view the video. (Depending on your browser plugins, you may need to download the video to your desktop instead.)

(An alternate, large version of the video is located here. This version is not included in the zip file.)

2) Debrief

The Paraiso movement began on the Isla Del Sueno in 2002 with the publication of the “Paraiso Manifesto” by Ferdinando Catalano.  The Paraiso ideology is based on supreme authority of the male head of household.  This authority supersedes the authority of the government.  It is also held that “disputes between Houses should be settled between the Heads of the Family themselves, with no interference from the government or police (Paraiso Manifesto, 2002).”  The ideology also prevents medical professionals from treating household members without permission from the Head of the House.  Along with Ferdinando Catalano, Pedro Vidro is also an influential part of the movement, having published “To Grow, To Love, To Live as a Family” in 2003 with the same publisher as Ferdinando Catalano.

The beliefs of the Paraiso movement can lead to behaviors that conflict with governmental authority.  The belief that disputes should be handled without interference from the government or police can lead to escalating violence from family feuds.  Medical care provided to sick children at schools is another source of conflict as this is not allowed by the movement.  Wiki edits on September 4, 2006 make reference to news articles reporting prosecutions of Paraiso by the Belgians.  The Isle de Sueno government has been cracking down on the movement and this has caused a mass migration from the island to the United States.

Analysis of calling patterns on Isla Del Sueno between June 1, 2006 and June 10, 2006 reveals that the leaders of the movement organized an event that occurred between June 7 and June 8, 2006.  It appears that the main organizer of the event was David Vidro using phone with ID 1.  There is a shift in calling activity within the movement over this time, from the northern part of the island to the southern part.  The drop-off in call activity within the network on the island during June 8, 2006 indicates that activity moved off of the island.  From this it can be inferred that the Paraiso movement in general and David Vidro in particular, is involved in activity to get people off of the island.  This behavior may be explained by the crackdown on the movement by the local government causing an interest in migration.

Analysis of the Wikipedia edits reveals a polarized debate between supporters and members of the Paraiso movement who see it as a legitimate religious organization that is not involved in illegal activities and critics of the movement who compare it to a cult and cite violent activities.  Wiki posts from September 19, 2006 reference a violent confrontation between Paraiso members and the Department of Health.  Shortly after this on November 12, 2006 a post makes reference to a dead Catalano.  If this reference is interpreted literally to indicate that a member of the Catalano family had died it might be related to the violent events reports on September 19, 2006.  The death of a member of the Catalano family during a confrontation with the Department of Health would necessarily instigate a dispute between the head of the Catalano family and the family of the opponent.  The head of the Catalano family would also be required to settle the dispute himself.  Such an event would therefore provide a strong motive for violent activity against the Department of Health or one of its members to be carried out by the head of the Catalano family.

In August of 2007 an improvised explosive device was set off at a Miami, FL Department of Health building.  Analysis of the movements of RFID tags assigned to individuals in the building immediately before, during and after the event lends suspicion to Ramon Katalanow as the bomber.  It is notable that Katalanow can be converted to Catalano in two operations, changing the K to a C and removing the w.  If the events of September 19, 2006 did involve a dispute between a member of the Catalano family and the Department of Health then it is consistent with an understanding of Paraiso beliefs that the Catalano family would retaliate themselves.  It is also notable that Carlos Vidro was present at the time.  The history of friendship between the Catalano’s and Vidro’s indicates that Carlos would likely be aware of the planning of the event or even be an accomplice.  Eduardo Catalano and Jesus Vidro had attempted to migrate from the Isle Del Sueno to Florida twice.  Each time they traveled together.  They were successful on their third attempt, this time in a rustic vessel on an island near Cancun, Mexico.

3) Detailed Answer

Project Inception

North-East Visualization and Analytics Center (http://www.geovista.psu.edu/NEVAC/) includes collaborators from the Pennsylvania State University in State College, PA and Drexel University in Philadelphia, PA.  At the beginning of the project the team recognized the need for a computer supported collaborative work (CSCW) environment.  Adobe Connect was used to supplement conference calls with synchronous collaboration.  A TWiki instance and project space was created for asynchronous collaboration.  A wiki environment was deliberately selected to mimic the Intellipedia environment ("Gov't unveils a Wikipedia for spies - Analysts can add and edit content on government's classified Web site," 2006).  Table 1 shows the CSCW environment organized according to the “typology of collaborative situations” from Figure 2.6 in (Thomas & Cook, 2005).

 

Same Time
Synchronous

Different Time
Asynchronous

Same
Place

 

TWiki

Different
Place

Adobe Connect

TWiki

Table 1: CSCW Environment

With the CSCW environment in place team members were encouraged to explore data on their own for the first few weeks of the project.  The intent of this initial phase of the project was generate hypotheses as per “Strategies for Generating and Evaluating Hypotheses” in (Heuer, 1999).  Heuer proposed four strategies: Situational Logic, Applying Theory, Comparison with Historical Situations and Data Immersion.  From a “Situational Logic” perspective the evacuation event became the situation of focus.  It was initially hypothesized that the Paraiso organization was responsible for the bombing and that the dataset would reveal one or more bombers, victims and witness as per the mini-challenge questions.  It was also hypothesized that the other data sets would provide evidence relative to the bombing.  As the team did not have experience with terrorist events members were not aware of relevant theory pertaining to the problem.  Therefore the “Applying Theory” strategy was not utilized.  The “Comparison with Historical Situations” strategy was also not fruitful due to the novelty of the dataset to the team.  It also appeared that there was little similarity between this year’s contest and previous VAST datasets.  “Data Immersion” proved to be the primary strategy used by team members.  Each member spent the first few weeks exploring the individual datasets with their favorite tools.

Elaboration

Following a few weeks of “Data Immersion” work the team reconvened to develop an overall strategy.  As this year’s challenge was broken down into mini-challenges it was natural to divide the work by allowing individuals to focus on mini-challenges.  Team members then immersed themselves deeper in the data for specific mini-challenges by building custom visualizations of the data with Improvise.  Improvise is a meta-visualization system in that it does not provide a pre-conceived visualization interface.  Instead the tool starts with a blank screen, allowing the user to add the dataset and visualization widgets that seem most appropriate to the data at hand.  The tool is particularly well suited to the “Data Immersion” strategy as the tool allows the user to evolve the visualization concurrently with the user’s evolving understanding of the data.  Weekly team meetings were held to discuss findings from each mini-challenge sub-team.  Findings and ideas were added to the wiki to support collaboration.  Activities such as manually coding a sample of the wiki editors as “for” or “against” the movement provided an entry point to the analytical process (Analysis – Discussion\Analysis – Discussion.html).

Construction

Data Transformations

Each of the mini-challenge Improvise visualizations includes their own data loading and transformation modules.  These were developed in parallel with the Improvise visualizations.  Separately, custom procedures were written in Maple to support statistical analysis, algorithm development and transformation into the file formats necessary for commercial tools.  The data provided by the contest was inventoried and sorted by type (Analysis – Overview\Analysis - Overview.html).  As a first step to the integrated or commercial tool data analysis, each mini-challenge formatted dataset was loaded and transformed into a basic matrix data structure.  Documentation, text and image data was merely printed to serve as reference through-out the process.  The procedures written for the formatted data transformations were re-factored into a library (Deinosuchus.lib) and are available at Deinosuchus\Deinosuchus.html.  This library contains the following procedures:

·       WritePajekSemanticNetwork
Write the integrated dataset and hypotheses to a network in Pajek format.

·       ToEpoch
Convert timestamp to integer using minutes as the unit and Jan. 1, 2006 as the epoch.  Based on code from Joe Riel on MaplePrimes at http://www.mapleprimes.com/node/309. Format is fixed to %R, %e %B %Y.

·       LoadCellCallsSymbolic
Read the VAST 2008 cell call data into memory using string data types.

·       LoadCellCallsNumeric
Read the VAST 2008 cell call data into memory using numeric data types.

·       LoadWikiEditsPage
Read the VAST 2008 wiki edits page data into memory.

·       ParseWikiEditsPage
Parse the Wiki Edits Page data into tuples.

·       LoadParsedWikiEditsPage
Load and parse the Wiki Edits Page.

·       LoadMigrantData
Load and parse the Migrant Data, converting from XML format to a Matrix.

·       LoadOccupantsRFIDPathways
Load the Occupants RFID Pathway Data.

·       LoadOccupantsRFIDAssignments
Load the Occupants RFID Assignments.

·       GetXMLText
Get XML text from an XML Element accounting for special characters.

Loading and transforming the data requires less than 15 seconds on a typical laptop.  Implementation is given in: Analysis – Timeline\Timelines.html.  After the data was loaded a timeline was creative to give a basic overview of the temporal characteristics of the dataset: timeline.html

Figure 1: All data unified into matrices.

Wiki Analysis

The simple matrix model provides a common framework for further analysis.  Algorithms were then implemented to take advantage of the specific properties of each dataset.  Customized analysis was performed on the cell calls dataset (Analysis – Calls\Analysis - Calls.html).  Algorithm development included implementation of the measures of controversy described in (Brandes & Lerner, 2008).  Using this measure it is possible to profile the revision times for the wiki page of interest.  These measures provide an overall sense of the activity levels as shown in Figure 2 and Figure 3.  Implementation of the algorithms for controversy analysis is available in Analysis – Wiki Edits Page\Analysis – Wiki Edits Page.html.

Figure 2: Activity profile of wiki revisors.

Figure 3: Wiki revision activity with number of edits on the y and day of data on the x.

Following the visualization technique described in (Brandes & Lerner, 2008) a plot of the primary controversies is created as shown in Figure 3.  This robust visualization shows that revisers can be loosely clustered into four separate camps.  The most vocal revisers for each point of view are shown on the top, bottom, left and right of the visualization.  The primary controversy is being played out by VictoriaV and Rm99.  VictoriaV is a defender of the Paraiso movement as a legitimate religious organization that is not involved in any illegal activities.  Rm99 holds to view that Paraiso is a cult and that it is involved in violent and illegal activities.

Figure 4: The Wheel of Controversy, with lines representing conflicting points of view, weighted by intensity of conflict.  Revisers with similar viewpoints cluster together.

A simple export (transform) of the identification of the controversy and the measure of intensity can be used with Microsoft Excel to explore the revision comments as shown in Figure 5.

Figure 5: Analysis of the most prominent controversy between VictoriaV and Rm99.

Another simple transform is performed to interface with SPSS so that clustering and additional plotting can be performed as shown in Figure 6.  A k-means clustering is performed with the number of clusters set to six based on a visual analysis of Figure 4.  Assignment of meaning to the clusters is done through filtering and analysis of the revision comments in Excel.

View A

View B

Cluster 1: Status-Quo.  Prototypical members include Estirabot and Guillermina whose edit activity is limited to reverting edits by others.

Cluster 6: Minor revisers.

Cluster 2: Defenders and clarifiers.  These members are content with the page and generally revert others or make small clarification.

Cluster 4: Minor opponents and revisers.  Prototypical members include Kurrop and other who revert changes made by Paraiso defenders.

Cluster 3: Paraiso members and supporters.  Prototypical members include VictoriaV and Sara. These members are likely members or supporters of the Paraiso movement.

Cluster 5: Opponents. Prototypical members include Rm99 and Edemir.

Table 2: Clusters of revisers by point of view.



Figure 6: K-Means Clustering performed in SPSS with the number of clusters set to 6.

Integrated Analysis

The findings of the mini-challenges provided critical information for developing the situational picture for the integrated analysis.  Although the matrices of data provided a common framework for numerical analyses a new data model was necessary to perform analysis across the heterogeneous datasets and to incorporate the learning from the mini-challenge work done with Improvise.  An associative network was selected due to its robustness for associating different types of data.

The hypotheses generated from the analysis of the Department of Health evacuation dataset using Improvise provided useful new data.  The answers to the “Traces-2,” Traces-3” and “Traces-4” mini-challenges were modeled in the associative network.  Nodes were created to represent the hypotheses themselves, and these nodes were linked to the nodes representing the RFID values relevant to the hypotheses.  The implementation is shown in Figure 7.  Exporting the network in Pajek format allowed for the use of Pajek as an analytical tool in the process.

Figure 7: Modeling the evacuation mini-challenge hypotheses in an associative network.

Figure 8: A path from RFID 21 to Telephone ID 5

Plotting the integrated associative network in Pajek allows for the analyst to explore the relations between given entities.  The entities could be telephones, RFIDs, passenger rosters, hypotheses or other elements of the situation.  Links may represent various kinds of relations in the situation.  For example surname “Katalanow” is associated with surname “Catalano” by virtue of the two names having a Levenshtein similarity score less than or equal to 2.  It was manually observed that “Katalanow” seemed to be like “Catalano” in some undefined way.  Using the Maple numerical framework and an associative network model it was then simple to encode this implicit knowledge explicitly as shown in Figure 9.  The Levenshtein threshold was used to define the relationship for this one case and then easily applied to all surname nodes in the dataset.

Figure 9: Instantiation of associations between surnames by Levenshtein similarity.

Just as an iterative process of exploration was used to expand the semantic network structure, an interactive process of exploration was used to explore paths within the network.  Figure 8 hints at a path between RFID 21 and RFID 62.  Further refinement of the plot allows for exploration of this path as shown in Figure 10.

Figure 10: Path from RFID 21 to RFID 62.

This path highlights another relationship between the Catalano and Vidro families.  Carlos Vidro is not listed as a suspect in the evacuation dataset based on the RFID path behavior.  If Carlos is of the same Vidro family as Jesus or David or Pedro Vidro that he may be involved in the bombing in some way.  The “I” nodes represent interdictions.  In this case the passenger relations indicate that members of the Vidro and Catalano families were passengers together on the same trips.  Here it was Eduardo Catalano and Jesus Vidro who traveled together on the three dates listed.

Adding the findings and hypotheses explicitly as nodes and links is beneficial in many ways.  The authority of the source data is preserved as new relationships are added due to the labeling of the new relationships separately from the source data.  Figure 10 highlights this feature in showing that any conclusions derived from the path necessarily depend on the assumption that “Katalanow” is an alternative spelling of “Catalano.”  It is also possible to model conflicting hypothesis simultaneously and then to evaluate them in the larger context.  For example RFID 56 is associated with Cleveland Jimenez who is hypothesized to be both a suspect and a casualty as shown in Figure 11.  This proposition can be explored in more detail by tracing connected associations that may support or refute the idea.  This modeling approach also scales well allowing for new hypotheses to be added over the course of the analysis without requiring a restructuring of the existing data model.  This versatility of the model is exemplified by the ability to find patterns of interactions across such diverse datasets.  Confidence measures can be included by assigned weights to the arcs in the graph although that was not done in this case.

Figure 11: k-Neighbors within 4 of RFID 56.

References

Brandes, U., & Lerner, J. (2008). Visual analysis of controversy in user-generated encyclopedias. Inf Visualization, 7(1), 34-48.

Gov't unveils a Wikipedia for spies - Analysts can add and edit content on government's classified Web site. (2006, October 31, 2006). Retrieved July 10, 2008, 2008, from http://www.msnbc.msn.com/id/15503834/

Heuer, R.J., Jr. (1999). Strategies for Analytical Judgment: Transcending the Limits of Incomplete Information. In Psychology of Intelligence Analysis (pp. 31-50): Central Intelligence Agency.

Thomas, J.J., & Cook, K.A. (2005). The Science of Analytical Reasoning. In J.J. Thomas & K.A. Cook (Eds.), Illuminating the Path: The Research and Development Agenda for Visual Analytics (pp. 33-68). Los Alamitos, CA: IEEE Computer Society.